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Abstract 

Kernel methods are ubiquitous tools in machine learning. They have proven to be 
effective in many domains and tasks. Yet, kernel methods often require the user to 
select a predehned kernel to build an estimator with. However, there is often little 
reason for the a priori selection of a kernel. Even if a universal approximating ker¬ 
nel is selected, the quality of the hnite sample estimator may be greatly effected 
by the choice of kernel. Furthermore, when directly applying kernel methods, one 
typically needs to compute a N x N Gram matrix of pairwise kernel evaluations 
to work with a dataset of N instances. The computation of this Gram matrix pre¬ 
cludes the direct application of kernel methods on large datasets. In this paper we 
introduce Bayesian nonparmetric kernel (BaNK) learning, a generic, data-driven 
framework for scalable learning of kernels. We show that this framework can 
be used for performing both regression and classiheation tasks and scale to large 
datasets. Furthermore, we show that BaNK outperforms several other scalable 
approaches for kernel learning on a variety of real world datasets. 


1 Introduction 

Kernel methods have become a staple approach to modern machine learning. They, along with the 
representer theorem, have given a tractable way to optimize a myriad of machine learning tasks 
over broad function classes. Indeed, kernel methods like support vector machines (SVMs), kernel- 
ridge regression, kernel-PCA, and many others have been shown to be effective on a wide range of 
domains and applications. 

However, despite being a standard and well-studied tool in machine learning one commonly over¬ 
looked aspect to kernel methods is the choice of kernel to use. Typically, one is forced to make a 
specihe choice of kernel function to use a priori; once hxed, the choice of kernel will induce a re¬ 
producing kernel Hilbert space (RKHS) that optimization occurs over. Even when cross-validating 
kernel parameters, the function class is limited to a RKHS induced by the often arbitrary choice 
of the family of kernels being cross-validated. Yet, the choice of kernel can greatly impact perfor¬ 
mance. Indeed, even when using an universal approximating kernel, the hnite sample estimate will 
be affected by the kernel choice. Given that the choice of kernel is an important free parameter in 
kernel methods, and generally there is little a priori reasons for kernel selections, a principled and 
data-driven method for learning kernels is extremely useful. 

Moreover, another drawback to kernel methods is that they often do not scale to datasets with a 
large number of instances. This is because kernel methods typically require the computation of 
a large Gram matrix of kernel evaluations for all pairs of instances in a dataset. That is, when 
optimizing over a dataset of N instances using a kernel K a Gram matrix K S needs to be 

computed, where = K{xi, Xj) and x^’s are input covariates. When N is in the many thousands 
or more, the computation of K will be prohibitive. Furthermore, kernel methods will often require 

*These two authors had equal contribution. 
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manipulations of K such as taking inverses, which will result in a worse time complexity than the 
0{N‘^) time required for computing K. Considering that modern datasets are only increasing in 
size, and complicated machine learning tasks require large datasets for achieving a low risk, it is 
vital to mitigate the high computational cost of kernel methods. 

In order to provide a method that scales to large datasets and adaptively learns the kernel to use 
in a data-driven fashion, this paper presents the Bayesian nonparametric kernel-learning (BaNK) 
framework. BaNK is a novel approach that will use random features to both provide a scalable 
solution and learn kernels. 

Random features have been recently 
shown to be an effective way to scale 
kernel methods to large datasets. 

Roughly speaking, random feature 
techniques like random kitchen sinks 
(RKS) ifT^ work as follows. Given 
a shift invariant kernel K{x,x') — 
k{x — x'), one constructs an approx¬ 
imate primal space to estimate kernel evaluations K{x,x') as the dot product of finite vectors 
(p{x)'^(p{x'). The vectors p are constructed with random frequencies drawn from a distribution 
V that is defined by K. Similarly, a distribution V from which random frequencies are drawn from 
defines a kernel K that the random frequencies approximate. It is this last observation that is key 
for the BaNK framework. Whereas a typical use of RKS will consider only a fixed distribution V 
of random frequencies, via the kernel K, BaNK will allow the distribution T> to vary with the given 
data, effectively learning the kernel. In particular, BaNK shall vary T> with a graphical model ap¬ 
proach where we treat 2? as a latent parameter and place a prior on it (Figure [T]). The prior on V, 
along with the data generation model, will allow one to sample from a posterior over V in order to 
learn the corresponding kernel. 

Modeling as a mixture of Gaussians with a Dirichlet process prior allows BaNK to learn a kernel 
from a rich, broad class. Furthermore, with the use of random features, we are able to efficiently sam¬ 
ple the model parameters and work over larger datasets. Moreover, by using Metropolis-Flastings 
sampling from a proper posterior, the kernels we learn are interpretable and the random features are 
asymptotically guaranteed to come from the underlying posterior distribution unlike greedy non- 
convex optimization methods. 

Outline The rest of this paper is structured as follows. First we review the use of random features 
for kernel approximation and show how such an approach can be used for flexible and efficient 
kernel learning. Second, we detail our graphical model framework both for supervised regression 
and classification tasks. Third, we expound on our inference method for sampling from the model 
posterior. Forth, we illustrate the use and performance of BaNK for both regression and classification 
on several datasets. Lastly we cover related works and give concluding remarks. 



(a) RKS (b) BaNK 


Figure 1: (a) Traditional random feature approach where 
the distribution V of random features W is held fixed, (b) 
BaNK framework where V is taken to be random. 


2 Model 

2.1 Random Features for Kernel Estimation 

Below we briefly review the method of random Fourier features for the approximation of kernels 
iflStl . The details of the method will help motivate and explain our BaNK model. Henceforth, we 
will only consider continuous shift-invariant kernels defined over K{x^ y) = k{x — y) where 
x^y G and fc is a positive definite function. While useful and innovative, the use of random 
Fourier features for kernel approximation is just a Monte Carlo integration using Bochners theorem 
i^ . Bochners theorem states that a continuous shift-invariant kernel K{x,y) = k{x — y) is a 
positive definite function if and only if k{t) is the Fourier transform of a non-negative measure 
p{oj). Note further, that if k{0) = 1, then p{uj) will be a normalized density. That is, if we define 
Cuj{x) = exp(icLi^x), then 

k{x-y)= p{uj)exp{iuj'^{x-y))duj = Eu,r^p[C^{x)Cu{y)*]. (1) 

jk** 
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( 2 ) 


Hence, using Monte Carlo integration, we can approximate K{x, y) = k{x — y) using Wj p\ 

1 “ 

Hx-y) - (2/)*- 

i=i 

In particular, if our kernel k is real-valued, then we can discard the imaginary part of (|2]i; 

k{x-y) « ip{x)'^(p{y), (p{x) = -^[cosiulx),... ,cos{wMx),sm{uj'[x),... ,sm{ujMx)f. 

(3) 

The great advantage of such an approximation is that we may now estimate a function in the RKHS 
as a linear operator in the random features: f{x) = J2^i^iK{xi,x) ~ Y!,iLi'^i‘P{xi)'^(p{x) = 
'ilF(p{x), where ij} = YllLi Oii‘f{xi). Thus we may work directly in a primal space of <p(x) and 
avoid computing large Gram matrices. To recap, using the approximation of kernels with random 
features works as follows: choose a kernel dehned by k (with fc(0) = 1), take its Fourier transform, 
p{lo), which will be a pdf over draw M i.i.d. samples fromp(u;), estimate the kernel 

with K{x,y) ~ ip{x)'^(p{y) as in Q. 

However, Bochners theorem also allows one to work in the other direction. That is, we may start 
with a distribution V with pdf p{lo) and take the characteristic function (the inverse Fourier trans¬ 
formation) to define a shift-invariant kernel k. For example, suppose that p{uj) — Af{uj\p,'E), 
where Af{uj\p,'S,) is the pdf of A/'(p, S). Taking its characteristic function we see that k(t) = 
exp (i /i^f — would be the corresponding shift-invariant kernel. From the kernel learning 

perspective, Bochner’s theorem yields an object to manipulate for the learning of one’s kernel: p(uj) 
the distribution of random features. 

We consider distributions that are mixtures of Gaussians: 

K K 

p{uj) = ^ 7rfcA/'(w|/rfc, E^) ^ k(t) = ^ exp (iplt - . (4) 

k—l k—1 

This makes for very general kernels. In fact, i) noting that Gaussian mixture models are univer¬ 
sal approximators of densities and may hence approximate any spectral distribution, and ii) using 
Plancherel’s Theorem to relate spectral accuracies to the original domain it follows that: 

Proposition 2.1. The expression of p{uj) in (|4|i can approximate any shift invariant kernel. 


For the applications we consider we only need real-valued kernels, hence we approximate the real 
part of (|4li: 

K 

K{x,y) = ^TTfcexp {-\{x - 2 /)'^Efe(x - y)) cos (i/i^(x - y)) r; p{xfip{y), (5) 


where again ip{x) = -^;=[cos(a;f x),..., cos(w^x), sin(wfx),..., sin(a;^x)]^ with iXj p. An 
application of the random feature approximation bounds found in ifislfldll^ yields that: 
Proposition 2.2. For compact X C with finite diameter, we have that 


Pr 


sup \K{x,y) - ip{x)'^Lp{y)\>t 


= o 


1 

^exp 


-Me2 

4(c? -f 2) 


Using the above, it may be seen that one can effectively approximate shift-invariant kernels using 
random features drawn from Gaussian mixtures. However, in order to learn the kernel, one still 
needs a mechanism to determine the Gaussian mixture to use. We take a graphical model approach 
to determine the mixture for p{uj) in a principled, data-driven fashion. The concept of using non- 
parametric mixture prior on p with random frequencies was considered in 12^ without empirical 
details. 


2.2 Graphical Model 

Graphical models have served as important and effective tool in the learning of latent parameters for 
a multitude of applications ifTH] . Hence, we hnd graphical models to be an incredibly natural tool for 
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kernel learning. As described above, one may vary and tune kernels with the choice of density over 
random features, p(u}). Thus, in our model, we take this distribution itself to be a random, latent 
parameter. It is interesting to note that although it is often restrictive to set a generative process 
for one’s data, the BaNK model, with a stochastic random frequency distribution, is strictly more 
general than the traditional kernel method approach of using a fixed RBF kernel. Roughly speaking, 
the BaNK model will consist of three major parts: one, a prior for stochastically generating the 
random feature distribution p(a;); two, a prior for the generation of the parameters of a linear model 
in the primal space of random features; three, a generative data model with noise to generate labels 
given input covariates and the rest of the parameters. 

As previously mentioned, a robust and flexible choice of piuj) is a Gaussian mixture model; we 
generate it as follows. Since the number of modes of piuj) is not a priori known, we will assume it 
to be infinite, but given a finite dataset the model will realize only a finite number of Gaussians in 
the mixture. Hence we model the distribution as p(w) = ^k) where = 1 

and TTfc > 0. We use a Dirichlet process (DP) prior on the components of our Gaussian mixture. The 
Dirichlet process is a distribution over discrete probability measures (i.e., atoms), G = 
with countably infinite support, where the finite-dimensional marginals are distributed according to 
a finite Dirichlet distribution It is parametrized by a base probability measure H, which deter¬ 
mines the distribution of the atom locations, and a concentration parameter a > 0 that is proportional 
to the inverse variance of the atom locations. The DP can be used as the distribution over mixing 
measures in a nonparametric mixture model. While the DP allows for an infinite number of clusters 
a priori, any finite dataset will be modeled using a finite, but random, number of clusters. We sample 
the mixture weights from stick breaking prior whcih produces samples distributed according to a DP. 
Thus TT ^ GEM {a) where GEM is the stick breaking prior. We also put a Normal-Inverse-Wishart 
prior (which acts as the base measure iT) on the mean pk and variance of each of the Gaussian 
components. 

Above we discussed how functions in a kernel’s RKHS can be approximated using a linear mapping 
in the random features, thus we consider models that operate linearly in the random features using a 
vector /3 S K.^^. As is standard in Bayesian regression and classification models lH, we generate 
P from a Normal prior, P ^ J\f {pp^al). 

Lastly, we generate our observations given a dataset X := (ii,..., xnY' where each Xi £ For 
example in regression tasks we have: 

y = g{x) + e (6) 

where g is approximated to P'^(p{x) and (p{x) is calculated using @. Thus y ^ Af {P'^(p{x), a^). 

The complete generative model is given below and the corresponding plate diagram is shown in 
Figure |2] 

1. Draw the mixture weights tt over components of the kernel, tt ~ GEM {a). 

2. Draw the mixture components from Normal-Inverse-Wishart distribution. I.e. draw ^ 
>V“i(T'o, vq), and pk ~ A/'(po, for A: = 1,... oo. 

3. For each random frequency index j = 1,... ,M 

(a) Draw the component from which the frequency vector is drawn. Zj ^ Mult{TT). 

(b) Draw the corresponding random frequency vector Wj ^ J\f{oj\pZi , 

4. Draw the weight vector, P ^ J\f{pp^ al). 

5. For each data point index i = 1,... ,N 

(a) Define (p{Xi) as in Q. 

(b) Draw the observation, e.g. for regression: Yi ^ P, a^I). 

We note that the only change when going from regression to classification is in the step 5(b) of the 
generative procedure. This time we draw from a sigmoid. 

5 For each data point index i = 1,..., N 

(b) Draw the output binary label 1) ~ a{(p{Xi)'^P), where cr(a:) = 
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Figure 2; Plate diagram for the graphical model for BaNK learning framework. 


3 Inference 

We propose a MCMC based solution for inferring the parameters of the mixture of Gaussian distri¬ 
bution that defines p{u!). This includes finding the component assignment vector Z and the mean 
and covariance pk and Sfc for each component. We will also sample the random frequencies W 
while marginalizing other parameters including tt and 13. The sampling equations for Z, 
remain the same for both regression and classification and will differ only when sampling W. We 
will first describe the part of the solution which is common to both regression and classification, and 
then describe how to get parameters specific to the two tasks. 

We want to sample from p{Z, p, S, kF|2f, Y, rest), where rest are all the hyper-parameter of our 
model while other parameters including [3 and tt have been integrated out. We use Gibbs sampling 
and sample each variable at a time given all other variable. 

3.1 Sampling Zj 

Recall that Zj indicates which component the random frequency Wj is drawn from. We use the Chi¬ 
nese restaurant process analogy to integrate out tt, the component priors, out. Let rrik = = 

k). The sampling equation for Zj can be derived from and is shown below 

PiZj = k\p, E, W, X, Y, rest) oc | ^k) > 0 

I MrT+^A/'(a;j>fc,crfc) 

where S{Zi = k) and Wj = ujj. For a new component (m'^^ = 0), we sample the 

mean and the variance from the Normal-Inverse-Wishart prior. 

3.2 Sampling pk and Yk 

Given the component assignment Z and the random frequencies W, the posterior distribution of the 
covariance of each Gaussian component in the mixture is Inverse-Wishart, ie ^ Vk) 

where - w\w, - W’^)^ + - po){W" - Mo)^, where 

W = ^ J2j-z =k k'k = Vo + TTik- Similarly, the posterior distribution of pk given, E^, 

Z and W is a normal; i.e. pk '^^k), where pk = and = kq -f rrifc. See 

IT^ for more details. 

3.3 Sampling W 

We derive a Metropolis-Hasting (MH) sampler for sampling W. The posterior distribution of the 
random frequencies W given the assignment Z, the parameters of the component p and E, and the 
data, X and Y is proportional to 

P{W\Z,p,Y,Y,X,rest) oc P{W\Z, p,Y)P{Y\X,W, rest). (8) 

The first term in the LHS is a normal distribution P{W\Z, p, E) = A/"( \pzj , ). Since it 

is difficult to sample directly from the posterior, we use MH, where the first factor of the LHS of 
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(O is used as a proposal distribution; i.e. Q{W) = P{W\Z, /i, E). Now, the acceptance ratio for a 
newly proposed W* is given by r = min 11, }■ 

Here the ratio is a ratio of model evidences and is calculated differently for regression and classifi¬ 
cation. Note that the choice to marginalize out /3 in this manner is intuitive; good random features 
are those for which there exist a j3 that models Y well. Hence, we expect a sampling scheme that 
marginalizes (3 to be more efficient than one that does not (and instead samples W w.r.t. a current 
sample of /?). 

3.3.1 Regression 

For regression we make use for conjugacy between prior for /3, and the likelihood to get a closed 
form solution for P{Y\X, rest). In this case we sample from Inverse — Gamma{ao, bo). 
The model evidence is then; 

P{Y\X,W,rest)= f P(r|VF,X,/?,a,)P(/?)P(a,)d/3da, cx ( 9 ) 

jp.a, r(ao) bn" y |A„| 

where Aq = ^/, <^{X) = {ip{Xi)'^ ... (p{Xn)'^)'^, A„ = <I>{X)'^<I>{X) + Aq, = A“i(Ao/;t ;3 -f 
<^(X)'^Y), = oo -b § and &„ = 6 q -j- ^(Y'^Y + /Tq Aq/to — For more details refer to 

lU^ . It is worth noting that one may efficiently compute ratio’s of model evidences if proposing a 
single Wj at a time; this is because doing so amounts to making low-rank updates on <I>{X)'^I?{X). 


3.3.2 Classification 


The absence of conjugacy makes it difficult to directly estimate the model evidence 
P{Y\X, W, rest), so instead we estimate the ratio of model evidences. The ratio is computed by 
approximating the ratios of partition functions (see |01 chapter 11.6 pg 554 for details): 


r 



P{Y\X,W*,fii) \ 
P{Y\X,W,f}i) j’ 


( 10 ) 


where jSi ^ P(/3|y, X, W) I = 1,... L. Since the Gaussian prior is not conjugate to sigmoid, we 
approximate the posterior P{I3\Y, X, W) with a Gaussian (see for details on quadratic approx¬ 
imation). The mean (/3o) of the Gaussian is the mode of the posterior P(/3| Y, X, W), which can 
be calculated using numerical optimization. By using Laplace approximation of the posterior, the 
inverse of the covariance can be found as Sn = log P{/3\Y, W, X, resi) 01 


4 Experiments 

We illustrate the use and performance of BaNK for both regression and classification on synthetic 
and real-world datasets below. 


4.1 Synthetic Data 


We give a simple 1-d kernel learning il¬ 
lustration with BaNK using synthetic data. 
We consider the shift-invariant kernel 
k{t) = exp (5 + 5COs(|7rf)); 

that is, the kernel whose random frequency 
distribution is p{oj) = iA/’(a;|0, + 

\N{ujWtt,^) (see Figure O. We look 
to learn the underlying kernel using 250 
frequencies. We generated N = 1000 in¬ 
stances D = {Xi,Yi)}f^.^ where Xi 
A/'(0,4^), Y) -- M{ipp{Xi)'^P, 1), with pp 
being the random features from the ker¬ 
nel’s true spectral distribution Uj p and 
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(a) Synthetic data 



(b) Kernel Approximation 


Figure 3: (a) Synthetic data used, (b) True k in dashed 
red, k estimated with true spectral distribution p in 
cyan, and BaNK estimate in blue. 


*See http :/ /www.cedar.buffalo.edu/~srihari/CSE574/Chap4/Chap4-Part5.pdf 
for details of derivation 
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/3 ~ A/'(0, 1). As explained above, using 

BaNK one may estimate p by drawing from the posterior. We plot such a draw in Figure [3b). One 
can see that even though the underlying spectral distribution is multi-model, and the kernel is not 
easily decernable to the human eye based on the data plot (Figure jHa)), BaNK approximates the 
kernel rather well. 


Table 1; Regression MSE on UCI MLR. 


Dataset 

N 

d 

RKS 

BaNK 

AlaC 

MKL 

concrete 

1030 

8 

0.1295 ± 0.0088 

0.1204 ± 0.0094 

0.2107 ± 0.0592 

0.0978 ± 0.0079 

noise 

1503 

5 

0.6945 ± 0.0116 

0.5013 ±0.1107 

0.7818 ± 0.0689 

0.2936 ± 0.2936 

prop 

11934 

16 

0.0069 ± 0.0005 

2.4 X 10”® ± 1.8 X 10”® 

0.0019 ± 0.0002 

0.0001 ± 6.4 X 10”' 

bike 

17379 

12 

0.2056 ± 0.0034 

0.0589 ± 0.0021 

0.0648 ± 0.0044 

0.0935 ± 0.0026 

tom’s 

28179 

96 

0.0796 ± 0.0344 

0.0153 ± 0.0061 

0.0811 ± 0.0308 

0.1007 ±0.0320 

music 

515345 

90 

0.8488 ± 0.0027 

0.7386 ±0.0041 

0.6827 ± 0.0033 

0.8078 ± 0.0017 

twitter 

583250 

77 

0.3819 ± 0.0511 

0.0777 ± 0.0214 

0.3168 ± 0.0877 

0.4333 ± 0.0588 


4.2 Regression 

Below we run experiments with various real-world datasets found in the UCI machine learning 
repository (UCI MLRfl We compare BaNK to the straight-forward random feature approach with a 
fixed kernel as well as other competitive random feature based kernel learning methods. In particular 
we compare to the following methods: 

RKS For this method we take input covariates to be random features p{xi) as in Q. Here we take 
the random frequencies Wj Af(0,a~^I). This corresponds to approximating the RBF kernel: 

K{xi,xi) = exp {-^\\xi - xiW^). 

MKL One of the most widely used approaches to kernel-learning is multiple kernel learning (MKL) 
iGlEl. Here, one attempts to learn a kernel using a non-negative linear combination of a fixed bank 
of kernels. That is, MKL attempts to learn a kernel K: 

M 

K{xi,xi) = ^ amKmixi,xi), wherea™ > 0, (11) 

m—1 

and iTi,..., Km are predefined kernels. The kernel weights am would then be optimized according 
to one’s loss. Note that (fTTT l still requires the computation of a TV x TV Gram matrix, in fact, it 
requires M such Gram matrices. However, we extend MKL to use random features and scale to 
larger datasets. If Km{xi,xi) ~ (pm{xi)'^(pm{xi), then 

M 

K{xi,Xl) Ki ^ am<PmiXi)'^ (fimixi) = (p{Xi)^ (p{xi) , (12) 

m—1 

where (p{xi) = [y/aipi{xi )^,..., y/oMPMixiY'Y". Hence, it is possible to work directly over 
input covariates of (p{xi) = [pi(xiY '^..., PMixiY'Y', the concatenation of the random features 
for each kernel ATi,..., Km- We take our bank of kernels to be Laplace, RBL, and Cauchy kernels 
at various scalings. 

AlaC Very recently, independent work by ll^ has considered an optimization approach, called A 
la Carte, to learning a mixture of kernels. Here, an unconstrained, unpenalized, and non-convex 
problem is posed for regression and optimized over the parameters of the mixture model and linear 
weights. Specifically, we optimized the following problem based on ll28lFI : 

^ N f K D 'I 2 

min 2 X] 1 MkWkj + Xf p.k'j + Ptf sin [xf MkWkj + Xf pk^j > 

P’"’ [ k^i j=i J 

+ ^(ir“ir + iir"f), 

^https://archive.ics.uci.edu/ml/index.html 

^We note that uses approximate Gaussian matrices in a method called 

Fastfood Dl- However, we found no increase in speed for the matri¬ 
ces of sizes considered in the experiments (using the implementation found in 

http://www.mathworks.com/matlabcentral/fileexchange/4 9142-fastfood-kernel-expansions), 
so we draw Wkj from a Gaussian. 
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where G /3®™ G i/ G , Mk G G are optimized; Wkj G R'^ are 

standard Gaussian vectors that are drawn before optimizing and held fixed. 

We perform 5-fold cross-validation (picking parameters on validation sets and reporting back the 
error on test sets). The total number of random features was chosen to be 500. Below we report the 
mean squared error (MSE) ± standard errors; for better interpretability, we standardized the output 
responses. As per ll^ . we optimized AlaC using LBFGlfl We note although the optimization of 
the AlaC problem above took on average nearly twice as long as our sampling, we achieved a lower 
MSE on a majority of the datasets. 

4.3 Classification 

As previously mentioned, we may use the BaNK framework to perform kernel learning in classifica¬ 
tion tasks. Below we illustrate the use of BaNK for classification and kernel learning on real-world 
datasets from the UCI MLR. Furthermore, we compare the accuracies of the models found with 
BaNK to traditional scalable kernel methods for classification namely RKS and MKL using a logis¬ 
tic model. We see that BaNK proves to be an effective method in classification as well. 


Table 2: Classification Prediction Error on UCI MLR. 


Dataset 

N 

d 

RKS 

MKL 

BaNK 

PIMA 

768 

8 

0.2571 ± 0.0258 

0.2468 ± 0.0229 

0.2338 ± 0.0148 

Diabetic Retinopathy Debrecen 

1151 

20 

0.2638 ±0.0205 

0.3138 ± 0.0257 

0.2603 ± 0.0173 

Skin segmentation 

2103 

3 

0.0313 ±0.0057 

0.1706 ±0.1481 

0.0209 ± 0.0028 

EEG Eye State 

14980 

15 

0.1322 ± 0.0030 

0.1366 ±0.0764 

0.0992 ± 0.0028 

Statlog Space Shuttle 

58000 

9 

0.0028 ±0.0010 

0.0018 ±0.0001 

0.0009 ± 0.0001 


5 Related Works 

Given the poor scaling of kernel methods on datasets with many instances, several methods have 
looked at approximations of kernels that bypass the computation of large Gram matrices. For exam¬ 
ple, Nystom based methods 101 look to give a fast to compute, low-rank approximation of the Gram 
matrix. Other approaches for summarizing the Gram matrix by the removal of elements or rows have 
been explored in ll^ l5ll^ l2^ . Futrthemore, approximations based on KD-trees have been explored 
in |[2^ . In this paper, we work with kernel approximations based on random features, called Ran¬ 
dom Kitchen Sinks ITslfT^fldl] . As previously discussed, random feature approaches work using an 
empirical estimate of kernels that stem by drawing features from the Fourier transform of positive 
definite functions, which will be a distribution if properly scaled. 

Kernel learning methods have also received attention due to the impact that kernel choice has on 
performance. Indeed, even if one fixes a family of kernels to use (e.g. RBF kernels) one still has to 
select the parameters of the kernels. This is often done with cross-validation or with methods like 
dll]. A very popular approach to learning kernels in a more flexible class is multiple kernel learn¬ 
ing (MKL) lUl Il^ . As illustrated above, in MLK, one looks to learn a kernel that is the non-negative 
linear combination of a bank of fixed kernels. However, naively applying MKL approaches would 
still require the computation of several large Gram matrices; instead, as previously discussed, one 
may combine random features to perform scalable MKL (see IfSll for an application of such ideas). 
Very recently, independent work has explored an optimization approach, A la carte, to learning 
parameters of mixture of kernels, where one optimizes the parameters generating random features 
in non-convex likelihood problems. See also 12] for another optimization approach. Unfortunately, 
due to the non-convex nature of the optimization problem being optimized, such approaches yield 
non-interpretable kernels, whereas BaNK yield draws from a well defined posterior, fxh considers 
kernels of the form in (|4li, with parameters chosen heuristically; ll^ mentions the possibility of 
putting a DP prior model on parameters, but does so without a sampling algorithm and empirical 
details. 


6 Conclusion 

In this paper we propose an efficient and general data-driven framework, BaNK, for learning of 
kernels that scales to large datasets. By representing the spectral density using a non-parametric 

"'Using https : //github .com/pcarbo/lbfgsb-matlab 
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mixture of Gaussians, we capture a large class of kernels that can be learned. We provide a generative 
model for learning kernels while performing regression and classification tasks, and propose novel 
MCMC based sampling schemes to infer parameters of the mixtures. We show that our proposed 
framework outperforms other scalable kernel learning methods on a variety of real world datasets in 
both classification and regression task. 
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